home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Amiga Format CD 46
/
Amiga Format CD46 (1999-10-20)(Future Publishing)(GB)[!][issue 1999-12].iso
/
-serious-
/
comms
/
www
/
urlx
/
urlx.readme
< prev
next >
Wrap
Text File
|
1999-09-06
|
2KB
|
66 lines
Short: V1.0 Extract URL's from any file +sort++
Uploader: frans@xfilesystem.freeserve.co.uk (francis swift)
Author: frans@xfilesystem.freeserve.co.uk (francis swift)
Type: comm/www
URL: www.xfilesystem.freeserve.co.uk
Some quick'n'nasty hacks, but I've included the source for you to look
at, especially as urlx uses btree routines and there aren't that many
simple examples of using btrees.
urlx
----
This program searches a file for url's (http:// etc) and prints them
or outputs them to a file. Internally it stores them in a btree to
allow duplicates to be eliminated and optionally to allow the output
to be sorted. There are two sorts available, -s selects a simple
alphabetic sort, and -u to select a special url sort that should provide
better grouping of similar site names (basically it sorts first url
element in groups backwards). The output can be either straight text or
by selecting -h in html format for making quick bookmark files. By default
any parameters after the url are ignored, but they can be kept by the use
of -p. You can also select to output just one type of file by selecting
the extension using -.ext, for example to show only .jpg url's you would
use -.jpg, and for .html you would use -.htm (which matches both .htm
and .html). A better solution for this last case is to use the -i flag
which selects not only .html extensions but also paths where a default
html would be expected.
Basically there are lots of options but you'll probably just end up using:
urlx -u infile outfile
which uses the special url sort, or
urlx -u -h infile outfile.html
for making a bookmark file.
treecat
-------
This is just a quick hack to let shell (sh/pdksh) users grab url's from
a complete directory tree. urlx accepts a single dash as meaning input
is from stdin, so you can use something like
treecat dh0:Voyager/cache | urlx -u - outfile
to produce a file containing every url in every file in your voyager cache.
You can use this on any browser cache tree.
scanv
-----
This is used specifically to pick out the url's from the headers on the files
in a voyager cache. This is the url of the file itself, the program doesn't
look in the file contents for any other url's, use treecat|urlx for that.
urlv
----
This is used specifically to grab url's from a Voyager history file, usually
called URL-History.1.
urla
----
This is used specifically to grab url's from an AWeb cache index file,
usually called AWCR.